Skip to content

Add per-module memory regression baseline check to CI Host#4430

Draft
lukemelia wants to merge 3 commits intomainfrom
memprobe-per-file-experiment
Draft

Add per-module memory regression baseline check to CI Host#4430
lukemelia wants to merge 3 commits intomainfrom
memprobe-per-file-experiment

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

Summary

  • Add per-module memory probes (MEMPROBE_FILE) to host tests via QUnit suiteStart/suiteEnd hooks, logging heap usage and delta for each top-level test module
  • Add __shard_warmup__ synthetic module that runs first on every shard, absorbing ~36MB of shared boot cost so real test modules report clean per-file deltas
  • Add CI pipeline to extract per-shard memory reports, compare against a committed baseline (memory-baseline.json), and flag regressions with tiered thresholds:
    • Warn: >10% increase or +5MB (whichever is greater)
    • Fail: >100% increase or +50MB (whichever is greater)
  • Auto-update baseline on main merge when all tests pass

How it works

  1. Each shard extracts MEMPROBE_FILE lines from test output into a JSON artifact
  2. The merge-reports job downloads all 20 shard reports and runs check-memory-baseline.mjs
  3. Results appear in GITHUB_STEP_SUMMARY; hard failures block the PR
  4. On main push, update-memory-baseline.mjs regenerates and commits the baseline

Noise validation

Ran 3 identical CI runs on the same SHA to measure reproducibility:

  • Median per-module delta spread: 0.0 MB across all size buckets
  • p90 spread: 0.7 MB; max spread: 39.6 MB (one outlier)
  • Sharding is 100% deterministic (same modules, same order, all 3 runs)
  • A 10% threshold produces essentially zero false positives on unchanged code

New files

  • packages/host/scripts/extract-memory-report.mjs — per-shard log parser
  • packages/host/scripts/check-memory-baseline.mjs — baseline comparison
  • packages/host/scripts/update-memory-baseline.mjs — baseline regeneration
  • packages/host/memory-baseline.json — initial baseline (170 modules, median of 3 runs)
  • packages/host/tests/helpers/shard-warmup.ts — synthetic warmup module

Test plan

  • Validated full pipeline via workflow_dispatch run 24538269499
  • All 20 memory report artifacts uploaded
  • Check step correctly identified 1 soft warning, 0 hard failures
  • Baseline update correctly skipped (not a main push)
  • Verify baseline auto-update works on merge to main

🤖 Generated with Claude Code

ylm and others added 2 commits April 16, 2026 18:52
- Add per-module memory probes via QUnit suiteStart/suiteEnd in setup-qunit.js
  that log heap delta for each top-level test module (MEMPROBE_FILE lines).
  Uses double-GC at module boundaries for accurate snapshots.
- Add __shard_warmup__ synthetic module (shard-warmup.ts) that runs first on
  every shard to absorb shared boot cost (~36MB), giving real test modules
  clean per-file deltas independent of shard position.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-shard memory reports are extracted from MEMPROBE_FILE test output and
uploaded as artifacts. The merge-reports job aggregates them and compares
against a committed baseline (packages/host/memory-baseline.json).

Tiered thresholds:
  - Warn: >10% increase or +5MB (whichever is greater)
  - Fail: >100% increase or +50MB (whichever is greater)

On main merge (all tests green), the baseline auto-updates so it tracks
the current state. On PRs, regressions are flagged in GITHUB_STEP_SUMMARY.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Preview deployments

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CI-level per-module memory regression check for Host tests by emitting per-module heap deltas during QUnit runs, aggregating shard reports, and comparing results against a committed baseline to warn/fail on regressions.

Changes:

  • Add a synthetic __shard_warmup__ test module to absorb shard boot cost before real modules run.
  • Add QUnit suiteStart/suiteEnd hooks to log per-top-level-module heap usage deltas (MEMPROBE_FILE ...).
  • Add CI steps and Node scripts to extract per-shard reports, compare to memory-baseline.json, and (on main) auto-update the baseline.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/host/tests/test-helper.js Side-effect import to register the warmup QUnit module before test partitions load.
packages/host/tests/helpers/shard-warmup.ts New synthetic warmup module that primes common runtime/setup work.
packages/host/tests/helpers/setup-qunit.js Adds per-top-level-module heap delta logging via QUnit suite hooks.
packages/host/scripts/extract-memory-report.mjs Parses test output and emits a per-shard JSON memory report artifact.
packages/host/scripts/check-memory-baseline.mjs Compares merged reports against a committed baseline and emits CI summary + exit code.
packages/host/scripts/update-memory-baseline.mjs Regenerates memory-baseline.json from shard reports (for main-branch updates).
packages/host/memory-baseline.json Introduces initial committed baseline for per-module memory deltas.
.github/workflows/ci-host.yaml Uploads memory artifacts per shard, merges them, checks baseline, and updates baseline on main.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to 26
contents: write
id-token: write
pull-requests: write
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow-level permissions grants contents: write for all events (including pull_request). That unnecessarily broad scope increases the blast radius of any workflow change in a PR. Consider keeping workflow default at contents: read and granting contents: write only at the job (or step) that performs the baseline commit/push, gated to push on main.

Suggested change
contents: write
id-token: write
pull-requests: write
contents: read
id-token: write
pull-requests: read

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +40
const JSON_ENVELOPE_RE = /\{"type":"log","text":"(.*?)"\}\s*$/;

const log = readFileSync(inputPath, 'utf8');
const report = {};

for (const rawLine of log.split('\n')) {
if (!rawLine.includes('MEMPROBE_FILE')) continue;

let line = rawLine;

// Unwrap testem JSON envelope if present
const envMatch = line.match(JSON_ENVELOPE_RE);
if (envMatch) {
try {
line = JSON.parse(`"${envMatch[1]}"`);
} catch {
// fall through to raw parse
}
}

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testem JSON envelope unwrapping is fragile: JSON_ENVELOPE_RE will stop at the first " inside the text field (which is very likely here because MEMPROBE_FILE logs include a quoted module= value), causing the unwrap to fail and the probe line to be skipped. A more robust approach is to JSON.parse(rawLine) when it looks like a JSON object and read .text from { type: "log" }, falling back to the raw line otherwise.

Suggested change
const JSON_ENVELOPE_RE = /\{"type":"log","text":"(.*?)"\}\s*$/;
const log = readFileSync(inputPath, 'utf8');
const report = {};
for (const rawLine of log.split('\n')) {
if (!rawLine.includes('MEMPROBE_FILE')) continue;
let line = rawLine;
// Unwrap testem JSON envelope if present
const envMatch = line.match(JSON_ENVELOPE_RE);
if (envMatch) {
try {
line = JSON.parse(`"${envMatch[1]}"`);
} catch {
// fall through to raw parse
}
}
function unwrapTestemLogLine(rawLine) {
const trimmed = rawLine.trim();
if (!trimmed.startsWith('{') || !trimmed.endsWith('}')) {
return rawLine;
}
try {
const parsed = JSON.parse(trimmed);
if (
parsed &&
parsed.type === 'log' &&
typeof parsed.text === 'string'
) {
return parsed.text;
}
} catch {
// fall through to raw parse
}
return rawLine;
}
const log = readFileSync(inputPath, 'utf8');
const report = {};
for (const rawLine of log.split('\n')) {
if (!rawLine.includes('MEMPROBE_FILE')) continue;
const line = unwrapTestemLogLine(rawLine);

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +116
if (failures.length > 0) {
lines.push(`### Failures (>${HARD_RELATIVE * 100}% increase or +${HARD_ABSOLUTE_MB}MB)\n`);
lines.push('| Module | Baseline | Current | Change |');
lines.push('|--------|----------|---------|--------|');
for (const f of failures.sort((a, b) => b.diff - a.diff)) {
lines.push(
`| ${f.mod} | ${f.baseline.toFixed(1)} MB | ${f.current.toFixed(1)} MB | +${f.diff.toFixed(1)} MB (+${f.pct}%) |`,
);
}
lines.push('');
}

if (warnings.length > 0) {
lines.push(`### Warnings (>${SOFT_RELATIVE * 100}% + ${SOFT_ABSOLUTE_MB}MB increase)\n`);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The headings in the step summary output don’t match the implemented thresholds. The code uses Math.max(absolute, relative) (i.e. “whichever is greater”), but the strings currently read like “or” / “10% + 5MB”, which can mislead readers about when a module will actually warn/fail. Update the summary text to reflect the “whichever is greater” behavior (or adjust the threshold logic to match the wording).

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Realm Server Test Results

  1 files  ±0    1 suites  ±0   14m 3s ⏱️ +17s
894 tests ±0  894 ✅ ±0  0 💤 ±0  0 ❌ ±0 
966 runs  ±0  966 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 3a2bd45. ± Comparison against base commit 9282ace.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Host Test Results

    1 files  +    1      1 suites  +1   2h 13m 26s ⏱️ + 2h 13m 26s
2 259 tests +2 259  2 244 ✅ +2 244  15 💤 +15  0 ❌ ±0 
2 278 runs  +2 278  2 263 ✅ +2 263  15 💤 +15  0 ❌ ±0 

Results for commit 3a2bd45. ± Comparison against base commit 9282ace.

♻️ This comment has been updated with latest results.

@backspace
Copy link
Copy Markdown
Contributor

  • Add CI pipeline to extract per-shard memory reports, compare against a committed baseline (memory-baseline.json), and flag regressions with tiered thresholds:

I see the baseline file in this PR but it doesn’t seem like the comparison or flagging are happening, is this still in progress or for a followup?

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants